Skip to content

KEP-5677: DRA Resource Availability Visibility#5749

Merged
k8s-ci-robot merged 10 commits intokubernetes:masterfrom
nmn3m:kep-5677-dra-resource-availability-visibility
Feb 10, 2026
Merged

KEP-5677: DRA Resource Availability Visibility#5749
k8s-ci-robot merged 10 commits intokubernetes:masterfrom
nmn3m:kep-5677-dra-resource-availability-visibility

Conversation

@nmn3m
Copy link
Copy Markdown
Member

@nmn3m nmn3m commented Dec 23, 2025

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Dec 23, 2025
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 23, 2025
@nmn3m nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from ca95081 to d9ac678 Compare December 29, 2025 23:26
@nmn3m nmn3m marked this pull request as ready for review December 29, 2025 23:31
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 29, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@nmn3m: GitHub didn't allow me to request PR reviews from the following users: kubernetes/sig-scheduling, kubernetes/sig-node, kubernetes/sig-cli, kubernetes/wg-device-management.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @kubernetes/sig-scheduling
/cc @kubernetes/sig-node
/cc @kubernetes/sig-cli
/cc @kubernetes/wg-device-management

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nmn3m nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from d9ac678 to 495b6cb Compare December 29, 2025 23:36
@nmn3m
Copy link
Copy Markdown
Member Author

nmn3m commented Dec 29, 2025

/cc @johnbelamaric
/cc @pohly

@nmn3m
Copy link
Copy Markdown
Member Author

nmn3m commented Dec 29, 2025

/cc @kubernetes/sig-cli-kubectl-maintainers

@mortent
Copy link
Copy Markdown
Member

mortent commented Jan 6, 2026

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 6, 2026
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jan 7, 2026
Copy link
Copy Markdown
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, this is looking really really good to me so far

@nmn3m nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from 495b6cb to fdbf949 Compare January 14, 2026 22:37
@nmn3m nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from dd8b15d to f26f289 Compare February 10, 2026 14:18
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Feb 10, 2026

@liggitt WDYT?

An always available, in-tree approach is preferable, but I am not sure there's a great option for that. Here's what I could think of:

  1. A "request" is made by creating a ResourcePool object, or maybe it's even called something like "ResourcePoolStatusRequest" or something. A controller runs in KCM that sees that request, makes the calculation, and writes the result to the object's status, where it can be observed by the user. It is a one-time operation with a timestamp. To recalculate, the user has to delete and recreate the object. The object probably should be cluster scoped.
  2. A specialized API endpoint built into API server. I suspect this is a no-go. Jordan, is there any precedent for that?
  3. A specialized API endpoint in KCM that is then exposed via an aggregated API configuration. Jordan, any precedent?

The first one seems promising if we want to do this in-tree. I actually think it's fine, and there is precedent for similar "imperative operations through declarative APIs" with things like CSR and even the way device taints with "None" effect works. It gives us the ability to controller permissions on the object, too.

For out-of-tree (could be in k-sigs), we could implement JUST a kubectl plugin and rely on user permissions, to start. And add in some aggregated API server later, if we see the need.

The advantage of in-tree: always available and in-sync with K8s releases, all users can depend on it. Disadvantage: locked to K8s release cycle.

Advantage of out-of-tree: we can implement it independently of the release cycle.

My preference: the first in-tree option.

Of those three options, the first seems the best to me as well. If it's an object we intend to be created, waited for, read, then deleted only to get a view of status, making it separate from a general ResourcePool type is a good idea.

We should also define behavior when multiple of these exist at the same time for the same pool (controller calculates once then fills all, etc).

Copy link
Copy Markdown
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions about the filtering / scale / limit bits, but those seem ok to pin down in implementation review as well

Comment on lines +547 to +548
| `driver` | Filter by driver name (optional) |
| `poolName` | Filter by pool name (optional, requires driver) |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making both of these optional means status is very unbounded ... should at least driver be required?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it will be no problem in that.
@johnbelamaric WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that's fine

|-------|-------------|
| `driver` | Filter by driver name (optional) |
| `poolName` | Filter by pool name (optional, requires driver) |
| `limit` | Max pools to return (default: 100, max: 1000) |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected a request structure that would not result in unbounded status that would require limits like this

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure where the 100 / 1000 limits came from ... with ResourceSlice, we've been really specific about the maximum size possible if all fields / lists are at their maximum size, to be sure the resulting resource could actually be persisted. Was that done here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the rigorous max-size calculation was not done. The 100/1000 numbers were chosen based on patterns in other K8s APIs, not from first principles. If we keep a limit field, we should do the proper calculation.

Alternatively, if we make driver required, the response becomes naturally bounded and the limit field may not be needed.

@liggitt WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there could still be a LOT for a given driver. I think we should have a limit. If we can calculate that now that's great, but we can also defer it to implementation time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we need a limit, make sure it is principled, and consider whether we need to make it user-specifiable, and consider whether the use cases we intend will break if a truncated response is received (e.g. an autoscaler couldn't use truncated info, right?)

@mrunalp
Copy link
Copy Markdown
Contributor

mrunalp commented Feb 10, 2026

/approve
for sig-node. Thanks for evaluating various alternatives reaching this design!
/hold

@johnbelamaric Please cancel the hold once ready. Thanks!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kannon92, mrunalp, nmn3m

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026
…ased on size calculations

Signed-off-by: Nour <nurmn3m@gmail.com>
@nmn3m nmn3m force-pushed the kep-5677-dra-resource-availability-visibility branch from f26f289 to a3d1151 Compare February 10, 2026 19:08
@johnbelamaric
Copy link
Copy Markdown
Member

/hold cancel

Thank you!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2026
@johnbelamaric
Copy link
Copy Markdown
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2026
@k8s-ci-robot k8s-ci-robot merged commit f48c046 into kubernetes:master Feb 10, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 10, 2026
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Feb 12, 2026

(post-merge note ... we might want to consider automatically deleting these via the controller after a fixed time period after creation / population ... we do that with other resources like certificate requests, etc ... can discuss during implementation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Archived in project
Status: ✅ Done
Archived in project

Development

Successfully merging this pull request may close these issues.